Winnowing, a Document Fingerprinting Algorithm
نویسندگان
چکیده
Among digital data, documents are the easiest to copy and remove any signatures or fingerprints embedded, which make the pirating the hardest to detect. Anyone can just retype a document or copy a part of it. Document fingerprinting is concerned with accurately identifying and copying, including small partial copies, within large sets of documents. We will make a literature study of Winnowing, a fingerprinting algorithm for documents. The Winnowing selects fingerprints from hashes of k-grams, a contiguous substring of length k. We will also show a document fingerprinting example to show the performance of the algorithm.
منابع مشابه
Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کاملHigh rate fingerprinting codes and the fingerprinting capacity
Including a unique code in each copy of a distributed document is an effective way of fighting intellectual piracy. Codes designed for this purpose that are secure against collusion attacks are called fingerprinting codes. In this paper we consider fingerprinting with the marking assumption and design codes that achieve much higher rates than previous constructions. We conjecture that these cod...
متن کاملNear Duplicate Text Detection Using Frequency-Biased Signatures
As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of ...
متن کاملAccurate discovery of co-derivative documents via duplicate text detection
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences,...
متن کاملChaffing and Winnowing: Confidentiality without Encryption
• Encryption: transforming the message to a ciphertext such that an adversary who overhears the ciphertext can not determine the message sent. The legitimate receiver possesses a secret decryption key that allows him to reverse the encryption transformation and retrieve the message. The sender may have used the same key to encrypt the message (with symmetric encryption schemes) or used a differ...
متن کامل